Colloquium | 13/01/2021

Collaborating on Reproducible Code… !?


Collaborating:

You and your collaborators (including your
future self) can access the code and its history

Reproducible:

Your code runs and produces identical results
at different time points and on different systems

Schedule

  1. Working in different contexts: RStudio Projects
  2. Dynamic document generation: RMarkdown
  3. Version control: Git + GitHub
  4. Package management: renv
  5. Containerization: Docker
  6. Where to start?

0. Kudos

0. Kudos


1. Working in different contexts: RStudio Projects

Needs a cool image :)

1. RStudio Projects - What & Why?

  • What it does:
    • Allows to work in multiple different contexts (projects), e.g. one for each experiment
    • Each project is own working directory, workspace, history, and source documents
    • Each project is associated with a folder on your computer (= working directory)
  • Why it helps:
    • Have a separate, shareable working environment for each experiment
    • Keep all the files associated with a project together — data, scripts, results, figures
    • Work on multiple projects at once, each associated with its packages (and package versions), loaded data, etc.
    • Use only relative paths
    • Useful for version control

1. RStudio Projects – How?

  • In RStudio: File > New Project > …

1. RStudio Projects – Version 1: Create new project

1. RStudio Projects – Version 1: Create new project

1. RStudio Projects – Version 1: Create new project

1. RStudio Projects – Version 2: Create from version control (Git)

1. RStudio Projects – Version 2: Create from version control (Git)

1. RStudio Projects – Open and manage projects

1. RStudio Projects – Open and manage projects

1. RStudio Projects – Tricks and troubleshooting

  • Relative paths: path separator characters vary across systems & anchor points differ depending on contexts
    • Use the here-package (Müller, 2020) to define relative paths within the project: read.csv(here::here("data", "file_I_want.csv"))

2. Dynamic document generation: RMarkdown

Also needs a cool image :)

2. RMarkdown - What & Why?

  • What it does:
    • Creates dynamic documents with embedded chunks of code (R, Python, Julia, Stan, …), computed results , written text etc. (= LaTeX)
    • Markdown-files can be exported to documents (docx, rtf), presentations, pdfs, websites (html), … e.g using the knitr (Xie, 2015, 2020) and tinytex (Xie, 2015, 2020; for pdfs)
    • R code is dynamically rendered, and can be given in separate chunks (’’‘{r}’’‘) or inline (’ r … ’)
  • Why it helps:
    • Simple language (\(\neq\) LaTeX)
    • Integrates directly with statistical software (RStudio)
    • Saves code AND output in one file
    • Reduces copy & paste errors: reported results consistent with actual results

2. RMarkdown - How?

  • Installation: install.packages("rmarkdown") (Allaire et al., 2017)
  • Install ‘knitr’ package for easy access: install.packages("knitr") (Xie 2015, 2020)

2. RMarkdown - How?

  • Installation: install.packages("rmarkdown") (Allaire et al., 2017)
  • Install ‘knitr’ package for easy access: install.packages("knitr") (Xie 2015, 2020)
  • Open a markdown file (.Rmd): File > New File > R Markdown

2. RMarkdown - How?

  • Installation: install.packages("rmarkdown")
  • Open a markdown file: File > New File > R Markdown

2. RMarkdown - Tricks & troubleshooting

  • You don’t have RStudio installed: install Pandoc (http://pandoc.org) before installing markdown ()
  • Lengthy R code chunks: Install knitr-package (Xie, 2014, 2015, 2020) to customize chunks and knitting process
    • {r cache=TRUE,message=FALSE,warning=FALSE,results="hide", error = TRUE}
    • or use opts_chunk$set()-function
  • Knit to pdf: You need a LaTeX-installation
    • TinyTeX (Xie, 2010) is a light-weight, cross-platform distribution (install.packages("tinytex"); tinytex::install_tinytex()))
    • Separate code chunks by a blank line
  • Write and prepare APA journal articles: The Papaja-package (Aust & Barth, 2020) contains an R Markdown template for APA manuscripts, and helper functions to report results and generate tables in APA-style
  • Knit older .R code files: Put #’ in front of any top-level prose, including the header, or use:
#/*
rmarkdown::render(input = rstudioapi::getSourceEditorContext()$path,
                  output_format = rmarkdown::github_document()),
                  knit_root_dir = getwd()) #*/

3. Version control: Git + GitHub

3. Git + GitHub - What & Why?

  • What it does:
    • Tracks changes to files (data and code) over time: Sequence of “snapshots” (commits)
    • Allows to “go back in time”: Recall older versions or revert the entire project
    • Changes between commits can be compared
    • Organized in repositories: Collection of all snapshots
    • GitHub: Popular server for sharing materials (privately or publicly) and collaborating via git (also: GitLab and others)

3. Git + GitHub - What & Why?

  • Why it helps:
    • Keep things organized and track changes
    • Clean up code
    • Language agnostic
    • (Remote) backup
    • Work together, with collaborators (even simultaneously and parallel: branches, merges, pull requests) - and your future self
    • Web interface for your project and to track issues
    • Easily connected e.g. to the Open Science Framework (https://osf.io)

3. Git + GitHub – Installation

  • Register an account with GitHub: https://github.com/
  • (Update R, RStudio, and your packages: update.packages(ask = FALSE, checkBuilt = TRUE))
  • Is Git installed? Open your shell (“Terminal” in RStudio or on Mac, “Eingabeaufforderung” on Windows), and type: git --version. If “git: command not found”:
  • Install Git - Mac: Mac offers to install developer command line developer tools automatically. Click “Install”. If you don’t get the offer, type: xcode-select --install. Restart R.
  • Install Git - Windows: Install “Git Bash” (https://gitforwindows.org). Accept default settings. When asked about “Adjusting your PATH environment”, select “Git from the command line and also from 3-rd party software”. Restart R.
  • Configure Git: In the (Git Bash) shell, type
    • git config --global user.name 'your name'
    • git config --global user.email 'email associated with your GitHub account'
    • git config --global --list (Check whether everything worked)
  • Optional: Install a Git client. Find more info e.g. here: https://happygitwithr.com/git-client.html

3. Git + GitHub – Vocabulary

  • Vocabulary - Git:
    • Repo(sitory): Directory of files that Git manages holistically
    • Commit: Snapshot of all files in the repository, at a specific moment, each with a unique identifier (hash code or SHA) and description (commit message)
    • Diff: Set of differences between (any) two commits
    • Tag: Specific name for a certain snapshot (optional), e.g. “v1.0.3”, “preprint”, “submitted”

3. Git + GitHub – Vocabulary

  • Vocabulary - Git:
    • Repo(sitory): Directory of files that Git manages holistically
    • Commit: Snapshot of all files in the repository, at a specific moment, each with a unique identifier (hash code or SHA) and description (commit message)
    • Diff: Set of differences between (any) two commits
    • Tag: Specific name for a certain snapshot (optional), e.g. “v1.0.3”, “preprint”, “submitted”
  • Vocabulary - GitHub
    • Push: Send your local Git commits to GitHub
    • Pull: Compare and update your local Git with GitHub
    • Merge conflict: Git can’t be certain how to jointly apply diffs from two commits to their common parent. Resolve by picking manually, avoid by pushing often.

3. Git + GitHub - How?

  • Go to https://github.com/ and log in
  • Click “New repository”
    • Decide between “private” or “public”. Initialize with a README. Accept default for everything else.
    • Click “Create repository”
    • Copy the URL

3. Git + GitHub - How?

  • Clone your repository to RStudio
    • File > New Project > Select “Version Control” > Select “Git” > Enter your repository URL: https://github.com/YOUR-USERNAME/YOUR-REPOSITORY.git

3. Git + GitHub - Tricks & Troubleshooting

  • GitHub: No long-term guarantee for availability of service (is commercial)
    • Mirror snapshots on HU servers/OSF/Zenodo/FigShare/…
  • GitHub generally works better with non-proprietary (text) file formats (e.g., CSV) than with proprietary file formats (e.g., XLSX)
    • .md-files will be displayed like HTML
    • CSV will have a nice layout
    • README.md-files act like the landing page
    • Use internal links to refer to other files

4. Package management: renv

4. renv – What & Why?

  • What it does:
    • Creates a project-specific library of packages in the project folder
      (instead of C:/Program Files/R/R-4.0.2/library or the like)
    • Overwrites install.packages() to install packages in this local library
    • Keeps track of package versions in the renv.lock file


  • Why it helps:
    • Keeps package versions untouched by other projects
    • Allows you to revert to the previous state when an update has broken your analysis
    • Makes it easier to share package versions with your collaborators (e.g., via GitHub)
    • Can also keep track of Python packages

4. renv – How?

  1. Install renv just like any other R package via install.packages(renv)
  2. Open your project and initialize your project library via renv::init()
  3. After successfully installing or a package, use renv::snapshot(). This will write the current version of all packages that are installed (and used) in the project to the lockfile.
  4. If you want to revert to previous state (e.g., if an update to any of your packages has caused problems), use renv::restore()


Instead of step #2, you can also select “Use renv with this project” during project creation.

4. renv – Initializing the project library

4. renv – Example of a lockfile

4. renv – How?

Restoring someone else’s package versions:

  1. Clone or pull the repository from GitHub
  2. Open the the RStudio project (e.g. via the projectname.Rproj file)
  3. Use renv::restore() to install the package versions from the renv.lock file

4. renv – Troubleshooting

  • There may be some (inconsequential) warnings when switching between Mac and Windows
  • Installing and loading packages may take a while, especially if your project lives on a network drive
    (such as N:/)
  • For installing packages that are not on CRAN you can use remotes::install_github() and the like. Note, however, that at least on Windows, you may need to install additional tools for building these packages (via renv::equip() and/or from https://cran.r-project.org/bin/windows/Rtools/)

5. Containerization: Docker

5. Docker – What & Why?

  • What it does:
    • Creates a small, linux-based virtual machine on your computer
    • Makes it possible to run your scripts (or render your .Rmd files) on this virtual system
    • The recipe to build this system is stored in a Dockerfile that can be shared via GitHub


  • Why it helps:
    • Prevents differences between operating systems, R versions, region and language settings etc.
    • Ensures long-term reproducibility
    • Provides a starting point for cloud-based and high perfomance computing (HPC)
    • Pre-packaged Docker images are available for different languages (R, Python, MATLAB, LaTeX etc.)

5. Docker – How?

docker run -d  -e PASSWORD=1234 -p 8787:8787 -v /path/to/your/project:/home/rstudio/ rocker/rstudio
  • You can then access RStudio (running in the container) by opening http://localhost:8787 in your web browser (username: rstudio, password: 1234)
  • You can also build your own container by:
    • Choosing a base image from https://hub.docker.com/u/rocker (including the tidyverse, LaTeX etc.)
    • Creating a Dockerfile in your project directy, specyfing additional steps to execute when building the container, e.g., install.packages("renv"); renv::restore()

5. Docker – How?

  • Example for a Dockerfile:
# This as a text file stored with the name "Dockerfile" in your project directory.

# Base image from Docker Hub, including R, RStudio, the tidyverse, and LaTeX
FROM rocker/verse:4.0.2

# Set working directory within the container
WORDIR /home/rstudio

# Install renv
RUN R -e "remotes::install_version('renv', version = '0.12.0', repos = 'http://cran.us.r-project.org')"

# Copy the lock file
COPY renv.lock renv.lock

# Install package versions stored in the lockfile
RUN R -e "renv::consent(provided = TRUE)"
RUN R -e "renv::restore(prompt = FALSE)"

5. Docker – Beyond Docker

  • Some additional tools based on Docker:
    • With binder (https://mybinder.org) and Code Ocean (https://codeocean.com), you can run your analysis in the cloud; they will even create the Dockerfile for you if you don’t have your own one
    • Singularity (https://sylabs.io) is a fully compatible, open source clone of Docker which you can use on systems where you don’t have root access (e.g., on high performance clusters)


6. Where to start?

6. Where to start?

  • This wealth of tools can seem overwhelming
    • Adopting even one or two of them can help making your code more reproducbile
  • An RStudio project and renv are easy to set up even for existing projects
    • And will help a lot to make sure that you can still run your code at re-submission
  • Version control is best tried out with a new (real or toybox) project
    • Create an empty repository on GitHub and use it to create your RStudio project
  • Once you’ve made it this far, full computational reproduciblity (by containerizing your project) is just one more step away

7. Code along

Needs a cool image

Code along: GitHub repository and R project

  • Create a GitHub repository: Go to https://github.com/ > Enter your username and password > Click “New” Repository > Settings: Create a private repository (you can make it public later), add a README file, accept default settings for everything else. Click “Create repository” > Copy the URL
  • Open R Studio and create an R project: File > New Project… > Select Version Control > Select Git > Enter the repository URL (the URL you just copied: https://github.com/USERNAME/REPOSITORY-NAME), choose a project directory name and where it should be saved. Click “Create Project”.
  • Make changes to README file, commit, and push
    • Open your Readme file: File > Open File…
    • Make changes to your README (“This is my first GitHub repository”)
      • (Terminal alternative: ‘echo “YOUR-TEXT” >> README.md’)
    • Commit: Open the “Git” tab (next to the Environment) > Push the “Commit” button > Stage your change > Enter a commit message > Click on “Commit”.
    • Push to GitHub: Click on “Push”
      • ( Terminal alternative: "git add -A’. Then ‘git commit -m “COMMIT-MESSAGE”’. Then ‘git push’)

Code along: RMarkdown

  • Install RMarkdown: Type to console install.packages("rmarkdown") and install.packages("knitr")
  • Create an RMarkdown file: File > New File > R Markdown… > Select “Document” > Choose a title (e.g. “analyses_sleepstudy”) and HTML as default output format. Click “OK”.
  • Make changes to your Markdown file: Enter a first section title (# Random number generation), a text (“Here, we generate 10 random numbers between 0 and 10”), and an r code chunk ('''{r} sample(0:10, 10, replace = FALSE)''').
  • Knit your Markdown file: Knit button
  • Commit and push your change: Click on “Commit” > Stage files > Enter commit message > Click on “Commit” > Click on “Push”

4. renv – Code along

  • Initialize renv for your sleepstudy project using renv::init()
  • From the “Files” pane in RStudio, take a look at the renv.lock file
  • Install a new package:
install.packages("cowsay")
  • Actually use the package in one of your scripts (or .Rmd files):
cowsay::say("Hello world", "cow")
  • Write this change to the lockfile using renv::snapshot()
  • Commit and push your changes to GitHub

Thank you.